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1. INTRODUCTION 

As we know that the growth in technology helps the computers to produce huge amount of data. 
Additionally, such advancements and innovations in the medical database management systems generate 
large volumes of medical data. Healthcare industry contains very large and sensitive data. This data needs to 
be treated very careful to get benefitted from it. Diabetic Mellitus is a set of associated diseases in which 
the human body is unable to control the quantity of sugar in the blood. It results in high sugar levels in blood, 
may be as the body does not produce sufficient insulin, or may because cells do not react to the produced 
insulin. The focus is to develop the prediction models by using certain machine learning algorithms. 
The Machine Learning is an application of artificial intelligence as it helps the computer to learn on its own. 
The two classification of ML are supervised and unsupervised. The Supervised learning calculation utilizes 
the past experience to influence expectations on new or inconspicuous information while unsupervised 
calculations to can draw derivations from datasets. Machine learning algorithms are: 


Supervised learning techniques: 
Classification 

The procedure of finding the obscure information of the class name which is utilizing recent known 
information is called as class mark which is intern called as classification. The following are Popular 
Classification Algorithms: 
— Random forest 
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—- SVM 

—  K-Nearest neighbors 

— Decision tree 

— Naive Bayes 


Regression 
A supervised learning algorithm such as classification which finds the relationship between some 
independent variables with some dependent variables isn called Regression. The popular Regression 
algorithms are: 
— Simple Linear Regression 
— Multiple Linear Regression 
— Logistic Regression 
— Polynomial Regression 
— Linear Discriminant Analysis (LDA) 


Unsupervised Learning techniques: 
Clustering 
The process which classifies the similar objects into groups called as clustering mechanism. Some 
of the clustering techniques are: 
—  K-means clustering 
— Hierarchical clustering 


R studio 

An Integrated Development Environment (IDE) for R programming language which was founded 
by Jjallaire is called as R Studio. The command line that R Studio uses is interpreter. R studio used for 
statistical computing and graphics. R Studio is having many built-in packages so it can manipulate huge 
dataset for analysis. 


2. LITERATURE REVIEW 

The usage of big data for predicting diabetes has been conducted in many researches. Table 1 
display researches in the field and also critique given for each research papers. Considering the critique and 
notes of each published research, this research will propose a new model for resolving problems from 
previous research. 


Tabel 1. Review of related research 











No. Paper Author(s) Name of the Journal Methods Findings Notes/Critique 
1. Predicting International Journal Boot After Bootstrapping i-Plan to use further 
Diabetes in Uswa Ali Zia, of Scientific & strapping Accuracy: more advanced 
Medical Dr. Naeem Engineering resampling i.NaiveBayes- classifiers such as 
Datasets Using Khan. Research technique to 74.89% Neural Networks. 
Machine (IJSER). enhance the  ii.Decision Trees- ii. It should consider 
Learning accuracy and 94.44% some other important 
Techniques then applying iii. k-NN(for k=1) factors that are related 
i. Naive 93.79% to gestational 
Bayes, 4. k-NN(for k=3) - diabetes, like 
ii. Decision 76.79% metabolic syndrome, 
Trees family history, habit 
iii.k-Nearest of smoking, lazy 
Neighbors routines, some dietary 
(k-NN) patterns etc. 
2: Prediction of FikirteGirma, International i. Back Back Propagation i. Increment the 
Diabetes Using Woldemichael, Conference on Propagation Algorithm has accuracy of the 
Data Mining Sumitra Trends in Electronics Algorithm Accuracy-83.11% algorithms. 
Techniques Menaria and Informatics ii. J48 Sensitivity- 86.53% 
(ICOED Algorithm Specificity-76% 
iii. Naïve 
Bayes 
Classifier 
iv. Support 
Vector 
Machine. 
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Table continued 

3. Diabetes Deeraj Shetty, International i. Naive prediction of the i. Increment the 
Disease Kishor Rit, | Conference on Bayes disease will be accuracy of the 
Prediction Sohail Shaikh, Innovations in ii. k-NN done with the help algorithms. 

Using Data Nikita Patils Information, algorithms of Bayesian ii. So Working on 
Mining Embedded and algorithm and KNN some more attributes 
Communication algorithm and which is used to 
Systems (ICIIECS). analyze them by tackle the diabetes 
taking various even more. 

attributes of 

diabetes. 

4. Classification K. Sharmila, International Journal Decision tree i. Using R, the Possibility of 
of Dr. S.A. Vetha of Advanced dataset is analyzed developing efficient 
Diabetic Manickam Engineering and the correlation predictive models 
Patients by Research and Science coefficient for two using the information 
using Efficient (IJAERS). attributes is from the analysis 
Prediction calculated. which is already 
from Big Data ii. Decision Tree is carried out. 
using R Studio used to predict the 

type of Diabetes. 

5. Diagnosis of Aiswaryalyer, International Journal Decision tree J48 Cross iIn future the work, 
diabetes using S. Jeyalatha of Data Mining & Naïve Bayes. validation-74.8698 planned to be 
Classification and Ronak Knowledge % gathering the 
mining Sumbaly Management Process J48 Percentage information from 
Techniques (JDKP). Split-76.9565 % different locales over 

Naive Bayes- the world. 

79.5652 % ii.This work can be 
improved and 
extended for the 
automation of 
diabetes analysis. 

6. An Disease M. Deepika, The 2™ International i.Artificial Artificial Neural Efficient and 
Diagnosis Dr. K. Conference on Neural Network: 73.23% Accurate classifier 
using Data Kalaiselvi Inventive Network Logistic Regression can be developed. 
Mining Communication and ii.Decision :76.13% 

Techniques Computational Tree Decision Tree 
And Empirical Technologies iii. Logistic :77.87% 
study. (ICICCT). Regression 

iv. Naïve 

Bayes 

v. SVM 





3. PROPOSED SYSTEM 

We propose a classification model with boosted accuracy to predict the diabetic patient. In this 
model, we have employed different machine learning techniques are using like classification, regression and 
clustering. The major focus is to increase the accuracy by using resample technique on a benchmark well 
renowned PIMA diabetes dataset that was acquired from UCI machine learning repository, having eight 
attributes and one class label. The proposed framework is shown in Figure 1. The description of each phase is 


mentioned. 


3.1. Data selection 


Data selection is a process in which the most relevant data is selected from a specific domain to 
derive values that are informative and facilitate learning. PIMA diabetes dataset having 8 attributes that are 


used to predict the diabetes at earlier stage. This dataset is obtained from UCI repository. 


3.2. Data pre-processing 
Data pre-processing is a Machine Learning technique that includes changing crude information into 


reasonable configuration. It includes Data Cleaning, 


Data Discretization. 


3.3. Feature extraction through principle component analysis 

Feature Extraction on the dataset to determine the most suitable set of attributes that can help 
achieve better classification. The set of attributes suggested by the PCA are termed as feature vector. 
Feature reduction or dimensionality reduction will be benefitted us by reducing the computation and 


space complexity. 


Data Integration, Data Transformation, 


and 
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3.4. Resampling Filter 

The supervised Resample filter is applied to the pre-processed dataset. Re-sampling is a series of 
methods used to reconstruct your sample data sets, including training sets and validation sets. In this study, 
Boot strapping resampling technique to enhance the accuracy. 


Loading Data 


- Feature vector 
PCA(Dimensionality Reduction) 


Classification (RF, Decision Tree, SVM, Classification (RF, Decision Tree, SVM, 
Naive Bayes and K-NN) Naive Bayes and K-NN) 

Regression (SLR, MLR, NLR, Logistic Regression SLR, MLR, NLR, Logistic 
and LDA) and LDA 

Clustering § (K-meansand AGC) 


Evaluating Model Accuracy 


Model Evaluation 


Has desired 
accuracy 
met? 


Predict the diabetes disease 


End 





Figure 1. Proposed system for diabetes prediction system 


4. MACHINE LEARNING TECHNIQUES 
4.1. Classification 
4.1.1. Random forest 

The outfit learning technique used for the classification and regression that operates by constructing 
the multitude of decision trees at training time and outputting the class i.e mode of the classes or 
the regression of the individual trees. Irregular choice woods right for choice trees propensity which is used 
for over fitting on to their preparation set. 
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4.1.2. Support vector machine (SVM) 

SVM is a division of Supervised Learning Algorithm. The strategy used to perform regression, 
classification and outlier detection of data.SVM will be grouping the information dependent that on the hyper 
plane. The hyper plane is used to totally isolate the two classes in the best way and the most extreme edge 
hyper plane ought to be picked as a best separator. The two types SVM Classifiers that are been used are 
used are: Linear Classifier and Non-Linear Classifier. 


4.1.3. Decision tree 

The algorithm which is mainly used to produce a classification on training data and regression 
model into a tree structure is called as Decision tree algorithm, it is based on previous data to classify/predict 
class or target variables of future/new data with the help of decision rules or decision trees. Decision tree can 
be useful for both numerical and categorical data. The tree in which the root node in each level is a starting 
point or the best splitting attribute in that position which helps to test on an attribute is called as complete 
decision tree. The yield of the test will create branches. Leaf hub will go about as a last class mark or target 
variable to characterize/foresee the new information. Arrangement rules are attracted from root to leaf. 


4.1.4. Naïve bayes 

The algorithm performs classification tasks in the field of ML are called as Naive Bayes. It can 
perform classification very well on the dataset even it has huge records with multi class and binary class 
classification problems. The application of Naive Bayes is mainly to text analysis and Natural Language 
Processing. It works based on conditional probability. It can be represented (1). 


P(M|N)p(m) 
P(N) 


P(M|N) = (1) 


Here M and N are two events and, P(M|N) is the conditional probability of M given N.P(M) is 
the probability of M. P(N) is the probability of N. P (N| M) is the conditional probability of N given M. 


4.1.5. K-nearest neighbors 

The supervised classifier which is a best choice for K-NN is called as k-Nearest Neighbor. It is 
a best choice for the classification of k-NN kind of problems. In order to predict the target label of a test data, 
KNN which finds distance between nearest training data class labels and new test data point in the presence 
of K value? KNN uses K variable value between 0 to 10 normally. 


4.2. Regression 
4.2.1. Simple linear regression 

The linear Regression algorithm which explains the relationship between independent and 
dependent variables to predict the values of the dependent variable is called as Simple Linear Regression 
algorithm. Simple regression uses one independent variable. The simple linear regression model is 
represented (2). 


y= (bo +bix) (2) 


Here, x(independent variable) and y (dependant variable) are two factors involved in simple linear 
regression analysis. Also, bois the Y-intercept and b; is the Slope. 


4.2.2. Multiple linear regressions 

It explains the relationship between two or more independent variables and a dependent variable to 
predict the values of the dependent variable. It uses two or more independent variables. Dependent variable 
has a continuous and independent variable has discrete or continuous values. The multiple linear regression 
model is represented as (3) 


Y= (Po +P1X1+p2X2+. . .+PnXn) GB) 


Here xı, X2... Xn (independent variable) and y (dependant variable) are two factors involved in 
multiple linear regression analysis. Also bois the y-intercept and pj, p2... pn is the slope. 
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4.2.3. Logistic regression 

The predictive analysis which is used for the dependent variable is categorical called as Logistical 
Regression. Logistical Regression explains the relationship between one dependent variable and one or more 
independent variables. The various types of Logistic Regression are: 
—  Multinomial Logistic Regression (many) 
— Binary Logistic Regression (two) 
— Ordinal Logistic Regression (1) 

The categorical response has only two possible outcomes. Multinomial Logistic Regression has 
three or more outcomes without ordering whereas Ordinal Logistic Regression has three or more outcomes 
with ordering. 


4.2.4. Polynomial regression 

The form of regression analysis which explains the relationship between the independent variable 
and dependent variable as an nth degree polynomial is called as polynomial regression. It fits a non-linear 
relationship between the value of independent variable and conditional mean of dependent variable. It is 
represented as (4). 


X=atb*y4n (4) 


Here p is Dependent Variable, q is Independent Variable and n is Degree. 
It is used to fit the data very well when the data is below and above the regression model. It 
minimizes the cost function and provides optimum result on the regression. 


4.2.5. Linear discriminant analysis 

The process of using various data items and applying different functions to that set to analyze 
classes of objects or items separately is called Linear Discriminant Analysis. Image Recognition and 
Predictive analytics use this Linear Discriminant Analysis 


4.3. Clustering 
4.3.1. K-means clustering 

The unsupervised machine learning algorithm which is used to solve clustering problems by 
classifying the dataset into a number of clusters k (group of similar objects), which defines the number of 
clusters which is assumed before classifying the dataset. 


4.3.2. Hierarchical clustering 
The type of clustering algorithm which is used to build a hierarchy of clusters is called hierarchical 
clustering. The two types of Hierarchical Clustering are: 


4.3.3. Agglomerative clustering 
It is used to group objects into clusters based on their similarity. The result obtained at last is a tree 
representation of objects called Dendrogram. 


4.3.4. Divisive analysis 

This is a best down methodology where all perceptions begin in one bunch, and parts are performed 
recursively as one moves down the pecking order. A hierarchical clustering is often represented as 
a dendrogram. Each cluster will be representing with centroids. Distance will be calculated by using linkage. 


5. RESULTS AND ANALYSIS 

Indian diabetes dataset named PIMA were used for analysis for this study. It consists of eight 
independent attributes and one independent class attribute. The study was implemented by R programming 
language using R Studio. Machine learning algorithms like classification (Decision Tree, Naive Bayes, k-NN 
and Random Forest), regression (linear, multiple, logistic, LDA) and clustering (k-means, hierarchical 
agglomerative) are used to predict the diabetics disease in early stages as shown in Table 1. Measure 
Performance model by using accuracy as shown in Figure 2. 
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Table1. Predictive analysis of machine learning algorithms 
S. No Algorithm Accuracy 

1 Random forest 83% 

2. Decision tree 711% 

3. SVM 92% 

4. Naive Bayes 86% 

5. K-NN 91% 

6. Simple linear regression 98% 

T: Logistic regression 88% 

8. LDA 88% 

9. k-Means 81% 

10. Hierarchical agglomerative 74% 
100 
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Figure 2. Comparison of accuracy of various algorithms 


CONCLUSION AND FUTURE WORK 
Deep Learning and Data mining plays an important role in various fields such as Artificial 


Intelligence (AI) and Machine Learning (ML), Database Systems and more. The core objective is to enhance 
the accuracy of predictive model. This PIMA dataset will increase the accuracy of almost all algorithms but 
the SVM and linear regression leads over others. In future many advanced deep learning techniques will be 
used to increasing the accuracy of the algorithms. 
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